UPPC - Urdu Paraphrase Plagiarism Corpus
نویسندگان
چکیده
Paraphrase plagiarism is a significant and widespread problem and research shows that it is hard to detect. Several methods and automatic systems have been proposed to deal with it. However, evaluation and comparison of such solutions is not possible because of the unavailability of benchmark corpora with manual examples of paraphrase plagiarism. To deal with this issue, we present the novel development of a paraphrase plagiarism corpus containing simulated (manually created) examples in the Urdu language a language widely spoken around the world. This resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems.
منابع مشابه
Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection
Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attentio...
متن کاملPAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment: Notebook for PAN at CLEF 2015
In this paper we describe and evaluate the corpora submitted to the PAN 2015 shared task on plagiarism detection for text alignment. We received monoand cross-language corpora in the following languages (pairs): English, Persian, Chinese, and Urdu-English, English-Persian. We present an independent section for each submitted corpus including statistics, discussion of the obfuscation techniques ...
متن کاملRe-examining Machine Translation Metrics for Paraphrase Identification
We propose to re-examine the hypothesis that automated metrics developed for MT evaluation can prove useful for paraphrase identification in light of the significant work on the development of new MT metrics over the last 4 years. We show that a meta-classifier trained using nothing but recent MT metrics outperforms all previous paraphrase identification approaches on the Microsoft Research Par...
متن کاملDeveloping Monolingual English Corpus for Plagiarism Detection using Human Annotated Paraphrase Corpus
In this paper, we describe an approach to create monolingual English plagiarism detection corpus for the task of text alignment corpus construction in PAN 2015 competition. We propose two different obfuscation methods to fragment obfuscation for creating the cases of plagiarism. The first method is an artificial obfuscation which consists of variety of obfuscation strategies such as synonym sub...
متن کاملCross-Language Urdu-English (CLUE) Text Alignment Corpus: Notebook for PAN at CLEF 2015
Plagiarism is well known problem of the day. Easy access to print and electronic media and ready to use material made it easy to reuse the existing text in new document. The severity of the problem is much reduced in monolingual context by the automated and tailored effort made by the research community but the issue is yet not properly addressed in cross language (CL) text reuse. Any story or ...
متن کامل